Cs 674/info 630: Advanced Language Technologies Lecture 7 — September 18 2 Incorporating Term Frequencies
نویسندگان
چکیده
Apart from IDF, term frequencies are also important and we would like to incorporate them into our scoring function. From now on, we will treat Aj as a random variable that denotes the number of occurrences of term j in a document. So, what should P (Aj = a) and P (Aj = a|Rq = y) be? In other words, how do we model the distributions of these random variables? Here we have two options: continuous and discrete distributions. Picking a discrete distribution seems more natural, because we are dealing with word counts. Now we have to pick the kind of discrete distribution. Some natural options include:
منابع مشابه
Cs 674/info 630: Advanced Language Technologies
P~ θ : V 7→ [0, 1], where ~ θ is an element of the m-dimensional probability simplex. Hence the probability assigned to a single term vj is defined as: P~ θ (vj) def = θ[j]. Also recall from the previous lecture that the Kullback–Leibler (KL) divergence between two probability distributions P~ θ and P~ θ′ , i.e. the expected log-likelihood ratio with respect to P~ θ, is defined as: D(P~ θ ‖P~ θ...
متن کاملCS 674 / INFO 630 : Advanced Language Technologies Fall 2007
At the end of the previous lecture we were talking about how to incorporate implicit relevance feedback which came in the form of preferences, i.e. instead of absolute judgments (this document is relevant and that document is not) we had information from clickthrough data in the form of relative judgments (this document is more relevant than that document). We ended up with some sort of vector ...
متن کاملINFO 630 / CS 674 Lecture Notes
Today's lecture notes cover an introduction to the application of statistical language modeling to information retrieval as motivated by "The Language Modeling Approach to Information Retrieval" by Ponte and Croft from SIGIR '98. Language modeling is the 3rd major paradigm that we will cover in information retrieval. At the time of application, statistical language modeling had been used succes...
متن کاملCS 6740 : Advanced Language Technologies February 4 , 2010 Lecture 3 : Pivoted Document Length Normalization
In this lecture, we examine the impact of the length of a document on its relevance to queries. We show that document relevance is positively correlated with document length, and see that relevance scores that use the normalization techniques we’ve studied so far (L∞, L1, L2) do not capture this correlation correctly. Finally, we present the “pivoted document length normalization” technique int...
متن کاملAdvanced Parallel Processing Technologies - 9th International Symposium, APPT 2011, Shanghai, China, September 26-27, 2011. Proceedings
Give us 5 minutes and we will show you the best book to read today. This is it, the advanced parallel processing technologies 9th international symposium appt 2011 shanghai china september 26 27 2011 proceedings lecture notes in computer science that will be your best choice for better reading book. Your five times will not spend wasted by reading this website. You can take the book as a source...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2007